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Abstract 

The effect of errors in variables in empirical minimization is investigated. Given a loss I 
and a set of decision rules Q, we prove a general upper bound for an empirical minimization 
based on a deconvolution kernel and a noisy sample Zi = Xi + £j, i = 1, . . . , n. 

We apply this general upper bound to give th e rate of conve rgence for the expected 
excess risk in noisy clustering. A recent bound from iLevrardl ( 2012 ) proves that this rate is 
0(1 /n) in the direct case, under Pollard's regularity assumptions. Here the effect of noisy 
measurements gives a rate of the form 0(l/n~i+ 2 f> ), where 7 is the Holder regularity of the 
density of X whereas j3 is the degree of illposcdness. 

Keywords: Empirical minimization, Inverse problem, Fast rates, fc-means clustering 
1. Introduction 

Isolate meaningfull groups from the data is an interesting topic in data analysis with appli- 
cations in many fields, such as biology or s ocial scien c es. T his unsupervised learning task 
is known as clustering (see the early work of Hartigan ( 19751 )). Let X\, . . . , X n denote i.i.d. 
random variables with unknown law P on M. d , with density / with respect to a a- finite 
measure v. The problem of clustering is to assign to each observation a cluster over a finite 
number of k possible items. From statistical viewpoint, this problem can b e endowed int o 
the general and ex tensively studied problem of empirical minimization (see Vapnik (2000), 
Koltchinskiil l|2006l )) as follows. Let us consider a class of decision rules Q and a loss function 
I : G x R d where l(g, x) measures the loss of g at point x. We aim at choosing from the data 
X\ , . . . , X n a candidate g G Q that minimizes the risk functionnal: 



R l (g)=m(g,X), 



(1) 



where the expectation is taken over the unknown distribution P. For instance the /c-means 
algorithm proposes as criterion for partitioning the data the within cluster sum of squares 
c 1 — y min c \\x — Cj\\ 2 , where in the sequel || • || denotes the euclidian norm and c = (ci, . . . , Ck) 
is the set of possible clusters, with corresponding decision rule g c {x) = argmin^ ||x — Cj\\. 
The performances of a given g £ Q is measured through its non-negative excess risk, given 
by: 



Ri(g)-Ri(g*) 



(2) 
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where g* is a minimizer over Q of the risk (pQ). 

A classical way to tackle this issue in the direct case is to consider, if there exists, the 
Empirical Risk Minimizer (ERM) estimator defined as: 



g n = wgminRn{g), 
where R n (g) denotes the empirical risk defined as: 



(3) 



R n (g) 



n 

-^2l(g,X t ) :=PJ(g). 



i=l 



In the sequel the empirical measur e of th e direct sample X\, . . . ,X n will be denoted as 



P n . A large litterature (see Vapnik ( 200d ) for such a generality) deals with the statistical 
performances of [3]) in terms of the excess risk (|2j). The central point o f these papers is 
to control the complexity of th e set Q thanks to VC dimension (jVapnikl (I1982T0. entropy 
conditi ons dVan De GeeJ (]200(t l ) , or Rademacher complexity assumptions in iBartlett et alJ 
(|2005l ): iKoltchinskiil ||2006l ) ( see also iMassart and Nedeled (j2006h : iBlanchard et alJ (j2008h 
in supervised classification). The main probabilistic tool for this problem is the statement 
of uniform concentration of the empirical measure to the true measure. This can be easily 
seen using the so-called Vapnik's bound: 



Ri(g n ) - Ri(g*) < Ri{g n ) 



< 



2 sup | (R n 

9&Q 



R n (g n ) + R n (g 
Ri)(g)\ 



2sup|(P„ 



P)i(g)\- 



(4) 
(5) 



It is important to note that ([3]) can be improved using a local approach (see Massart (2000)) 
which consists in reducing the supremum to a neighborhood of g*. We do not develop this 
important refinement in this introduction for the sake of concision whereas it is the main 
ingredient of the literature cited above. It allows to get fast rates of convergence. 



In this paper the framework is essentially different since we observe a corrupted sample 
Z\, . . . , Z n such that: 



Zi = Xi + €i, i = 1, . . .n, 



(6) 



where the e^'s are i.i.d. M rf -random variables with density r\ with respect to the Lebesgue 
measure. As a result, from ([U the empirical measure P n = y t Y^=i &Xi ls unobservable and 
standard ERM © is not available. Unfortunately, using the corrupted sample Zi,...,Z n 
in standard ERM ([3]) seems problematic since: 



1 

P n l(g) :=-V%,^ 
n t—' 

i=i 



m{g,Z)^R l {g). 



Due to the action of the convolution operator, the empirical measure of the indirect sample, 
defined as P n := ^ ^™ =1 Sz t differs from P n and we are face d to an ill-posed inverse prob - 
lem. Note that this problem has been recently considered in lLoustau and Marteaul (120111 ) 
in discriminant analysis and in a more general supervised statistical learning context in 



2 



Fast rates for Noisy Clustering 



Loustaul (|201ll ). The main idea to get optimal upper bounds is to consider an empirical 



risk based on kernel deconvolution estimators. 

In this paper, we propose to adopt a comparable strategy in unsupervised statistical 
learning. To this end, we propose to construct a kernel deconvolution estimator of the 
density / of the form: 



1 n 1 
^ x ) = n^2\ ,Cr > 

i=l 



Zi 



X 



(7) 



where K,^ is a deconvolution kernel and A is a regularization parameter (see Section [2] for 
details). Given this estimator, we construct an empirical risk by plugging (|7|) into the true 
risk Ri(g) to get a so-called deconvolution empirical risk minimization given by: 



1 

aigmin R x (g) where R X (g) := -V)Z A 
g&g n 



(8) 



whereas l\(g, z) is a convolution of the loss function l(g 

1 



given by: 



h{g,z) 



X 



l(g, x)u(dx). 



Note that in case no such mini mum exists, we can consider ^-approximate minimizers as in 
Bartlett and Mendelsor] (j2006h . 



In order to study the performance s of a solution of ( 8} it is possible to use the empir i- 
cal process mach i nery in the spirit of Koltchinskii ( 2006 ): Bartlett and Mendelson (2006); 
Blanchard et al. ( 20081 ). In the presence of indirect observations, for g x a solution of the 



minimization of ([HJ we have: 



< R 



mil) 



< sup|«-i?f)(«7 
g&Q 



R x (g x ) + R x n (g*) 
R x (g x ) + R x n (g*) 

g)\ 



Ri(9*) 

R x (g*) + (Ri-R?M 
i - "m.7 - g* 



sup I (R x -Ri (g 



where in the sequel, under integrability conditions and using Fubini: 



R x (g)=ER x (g) 



l(g,x)E-]C n 



Z — x 
X 



v(dx). 



(9) 

9l (10) 
(11) 



(12) 



Bounds ([9]) are comparable to @ for the direct case. There consist in two terms: 

• A variance term sup flG g | (R x — R x )(g* — g)\ related to the estimation of Ri(g) using an 
empirical couterpart. This term will be controled using st andard tools f r om em pirical 
process theory, namely a local approach in the spirit of iKoltchinskiil d2006l h Here 
the empirical process is indexed by a class of functions depending on a smoothing 
parameter. 

• A bias term sup 96 g \ (R X — Ri){g— g*)\ due to the estimation procedure using kernel de- 
convolution estimator. It seems to be related to the usual bias term in nonpar ametric 
density deconvolution since we can see coarselly that: 

Z- 



Rt(g)-Ri(g) 



l(g,x) 



EjfCr, 



X 



fix) 



v{dx). 



3 



Sebastien Loustau 



The choice of A is crucial in the decomposition ([9l We will show below that the variance 
term grows when A tends to zero whereas the bias term vanishes. Parameter A has to 
be chosen as a trade-off between these two terms, and as a consequence will depend on 
unknown parameters. The problem of adaptation is not adressed in this paper but is an 
interesting future direction. 

The paper is organized as follows. In Section 2 , we p ropose to give a general upper 
bound for ([8J generalizing the results of Koltchinskii ( 20061 ) to indire c t obse rvations. Note 
that all the material of Section [2] is largely inspired from Loustau ( 201ll ) and gives an 
unsupervised counterpart of the previous results. Section 3 gives a direct application of 
the result of Section 2 in clustering by giving rates of convergence for a new deconvolution 
fe-means algorith m. Fast r a tes of convergence are proposed which generalize the recent fast 
rates proposed in iLevrardl (|2012l ) in the direct case. 



2. General Upper bound 

In this section we propose an upper bound for the expected excess risk of the estimator: 



1 - 

: = arg min - V" l x {g, Z, 
n i 



(13) 



i=l 



where l\(g,z) is construct as follows. 

Let us introduce K, = nf=i • ^ — > M a <i-dimensional function defined as the product 
of d unidimensional function fCj. Then if we denote by A = (Ai, . . . , A^) a set of (positive) 
bandwidths and by J-[-} the Fourier transform, we define /C^ as: 



IC, 



t ^ K v (t) = T' 1 



( t ). 



(14) 



Moreover in the sequel we restrict the study to a compact set K cM. d and define l\(g, z) as 



A 



A^ \ ~X~ ' ]{ ' l - :v), ' (<lrl 



zi-xi z d -x d 
Ai ' • • • ' A d 



where we write with a slight abuse of notation j)C v (^p) for d x X J C V 

i— 1 1 

The restriction to a compact K allo ws to control t he variance in decomposition ([9]) thanks 
to Lemma [T] (we refer the reader to lLoustaul (120111 ) for a discussion) . 

Finally in the sequel for the sake of simplicity we restrict ourselves to moderately ill- 
posed inverse problem and introduce the following assumption: 

Noise Assumption: There exist . . . , fid)' £ such that for all i G {1, . . . , d}, 



\r[n](t)\ 



-Pi 



as t — > +oo. 



Moreover, we assume that F[r]i](t) ^ for all t G M and i £ {1, . . . , d}. 
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Assumption (NA) deals with the asymptotic behavior of the characteristic function of the 
nois e distr i bution. These k i nd of restriction s are standard in deconvolution problems (see 
Fanl (jl99lh : iMeisterl (j2009h : iButuceal |20o3)). Note that straightforward modifications al- 



low to consider severely ill-posed inverse problems, where the asymptotic behavior of the 
characteristic function of e decreases exponentially to zero. 

Under (NA), the goal is to control the two terms of ([9j namely the bias term and the 
variance term. The variance term is reduced to the study of the increments of the empirical 
process: 



4=f>G7,2i)-EfAG7,S). 



It will be controled thanks to a version of Talagrand's inequality due to iBousquet (2002). 
However it is important to note that here the empirical process is indexed by the class of 
functions {z h-> l\(g, z),g G Q}, which depends on a regularization parameter A G M+. This 
parameter will be calibrated as a function of n so Talagrand's type inequality has to be 
used in a careful way. For this purpose, we need the following lemma. 

Lemma 1 Suppose f > Co > on K . Then if (NA ) holds, and JC has compactly supported 
Fourier transform, we have: 

(i) 1(g) i—T- l\(g) is Lipschitz with respect to X: 

3d > : Vg,g' G Q, \\l x (g) - *a(</) II l 2 (p) < C^X;^) - l{g')\\ L2{P) . 

(ii) {l\(g),g G Q} is uniformly bounded: 



3C 2 >0: sup \\l x (g,- 

g&G 



< c 2 nf =1 A f 



The proof of this result is presented in lLoustaul (|201ll ) in a slightly different framework. 
Note that the assumption on / to be strictly positive on K appears for some technical 
reason s in the proof and could be avoided in some cases (see the discussion in lLoustau 
(|201ll )). 

The Lipschitz property (i) is a key ingredient to control the complexity of the class of 
functions {l\(g),g G Q} thanks to standard complexity arguments applied to the loss class 
{K9)i 9 £ Finally (ii) is necessary to apply Talagrand's type inequality to the empirical 
process g \-t v^(g) above. 

To control the excess risk of the procedure, we also need to control the bias term defined 
in ([9]) thanks to Lemma [2]) below. 

Lemma 2 Suppose f G £(7, L) the Holder class of L7J-/0W continuously differentiable 
functions on R d satisfying the Holder condition. Let K, a kernel of order [7] with respect 
to v. Then ifl(g, ) G Li(u,M. d ), we have: 

d 

Vg,g'eG, \(Ri-R?)(g-g')\<Cj2*l 

8=1 
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The proof is presented in lLoustaul (|201ll ) and is omitted. Finally to state fast rates, we also 
require an additional assumption over the distribution P. 

Definition 2.1 We say that T is a Bernstein class with respect to P if there exists kq > 
such that for every f G T : 

||/||| 2(P) < «opE P /]. 

This notion of Bernstein class first appears in iBartlett and Mendelsor] (j2006T ) m a more 
general form. Definition 12.11 corresponds to the ideal case where k = 1 . This assumption 
ca n be related to the well-known m argin assumption in supervised classification, introduced 
Mammen and Tsvbakov ( 19991 ). Section [3] proposes an unsupervised version of this hy- 



m 

pothesis. 

From technical viewpoint, this requirement arises naturally in the proof when we want to ap- 
ply functional Bernstein's inequality such as Talagrand's inequality. If we consider the loss 
class J- = {1(g) — l(g*),g G Q}, Definition 12.11 gives a perfect variance-risk correspondance. 
We are now on time to state the main result of this section. 

Theorem 3 Suppose (NA) holds and assumptions of Lemma 1-2 hold. Suppose {1(g) — 
l(g*),g G Q} is Bernstein w.r.t. P where g* G arg ming Ri (g) is unique and there exists 
< p < 1 such that for every 5 > 0: 



E sup 

9,9'eS(<5) 



(P-P n )(l x (g)-l x (g')) 



-ft 



nli a." • .i^ 



■/?. 



(15) 



where G(5) = {g G Q : Ri(g) - Ri(g*) < 5}. 

Then estimator g = g^ defined in H3\) is such that: 



ERiig) - R^g*) < Cn «i+ P )+3P , 
where j3 = Yli=i ft an d A = (Ai, ■ • ■ , A^) is chosen as: 

\j ~ n~ i(i+p)+*P ,Vi = l,...d. 



The proof of this result iterates a version of Talagrand's inequality due to lBousquet (2002). 
It is presented in Section 5. coarsely speaking, Lemma [TJ gathering with the complexity 
assumption ()15[ leads to a control of the variance term in decomposition ([9j Then Lemma 
Ogives the order of the bias term. The choice of A explicited in Theorem [3] trades off these 
two terms, and gives the excess risk bound. 

Note that the rates of convergence in Theorem [3] generalize previous results. When 
e = Theorem [3] gives fast rates of convergence between 0(1/ y/n) to 0(l/n) depending on 
the complexity parameter p > in (|15l which can be related with entropy or Rademacher 
complexities of the hypothesis set Q. The effect of the inverse problem depends on the 
asymptotic behavior of the characteristic function of t he n o ise di stri bution e. This is rather 
standard in the statistical inverse problem literature ( Fan ( 1991 ) or Meister ( 20091 )). 
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Moreover the control of the modulus of continuity in (|15p is specific to the indirect 
framework and depe nds on the smooth ing; parameter A. A comparable hypothesis arises 
in the direct case in Koltchinskii ( 20061 ). up to the constant depending on A. It appears 
that it will be satisfied in our application using standard statistical learning argues, such 
as maximal inequalities and chaining. 

Finally note that the co mplexity parameter invo lved in as s umpt ion (|15p is smaller than 
the complexity proposed in iKoltchinskiii l|200d ) or lLoustaul l|201ll ). Here the supremum 
is taken over the set {g,g' G Q{5)} C {g,g' G Q : P(l(g) - l(g')) 2 < c5} provided that 
{1(g) — l(g*), g G G} is Bernstein with respect to P according to Definition l2.11 This indexing 
set is related to the localization's technique used in Theorem [31 namely a localization based 
on the excess risk inst ead of the L?(P)-norm. This refinement is necessary to derive fast 
rates in Section [3] (see Bartlett and Mendelson (|2006l ) for a related discussion). 



3. Application to noisy clustering 

Clustering is a basic problem in statistical learning where independent random variables 
X\, . . . , X n are observed, with common source distribution P. The aim is to construct 
clusters to classify these data. However in many real-life situations, direct data X\, . . . , X n 
are not available and measurement errors occur. Then we observe a corrupted sample 



Xi + €i 



1, ... n with unknown noisy distribution P. The problem of noisy clustering 



is to learn clusters for the direct dataset X\ , . . . , X n when only a contaminated version 



Z n is observed. 



To frame the noisy clustering problem as a statistical learning one, we first introduce 
the following notation. Let c = (c\, . . . , c^) G C the set of possible clusters, where C C Mr 



and X G 



dk 



The loss function 7 : 

j(c,x 



is defined as: 



mm 

j=i,...k 



and the corresponding true risk or clustering risk is R(c) = M r y(c,X). The performances 
of the empirical minimizer c n = argminc P n 7(c) (also called A;-means clustering algo rithm) 
have been widely stu died in the literatur e. Consistency was s hown by iPollardl ()198ll ) when 
E||X|| 2 < 00 whereas Einder et al.1 (j 19941 ) or lBiau et all fj200«h gives rates of convergence of 
the form 0(1/ y/n) for the excess clustering risk define d as R(c n ) — R (c*), where c* G M 
the set of all possible optimal clusters. More recently, Levrardl ( 20121 ) proposes fast rates 
of the form Q(l/n) u nder Pollard's regularity assumptions. It improves a previous result 
of lAntos et al.l (120051). The ma in ingredient of the proof is a localization argument in the 
spirit of Blanchard et al. ( 20081 ). 

In this section, we study the problem of clustering where we have at our disposal a 
corrupted sample Z% = Xi + e$, i = 1, . . . , n where the ej's are i.i.d. with density rj satisfying 
(NA). For this purpose, we introduce the following deconvolution empirical minimization: 



arg mm 
ceC n 



1 n 



l\(c,Zi), 



(16) 



=1 



where 7a (c, z) is a deconvolution cluster sum of squares defined as: 
jx(c,z)= [ —fcr)( — ^ — ^ min^ ||x — c 7 || 2 (ix, 



K 



A 
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for JCrj the deconvolution kernel of Section [2] and A = (Ai, . . . , A^) € a set of positive 
bandwidths ch osen later on. Note that here the existence of a minimizer in (|16|) could be 
managed as in Graf and Luschgy ( 2000l ) for the direct case. We investigate the general- 
ization ability of the solution of (|16p in the context of Pollard's regularity assumptions, 
thanks to the noisy empirical minimization results of Section [2j To this end, we will use 
the following assumptions on the source distribution P. 

Assumption 1 (Boundedness assumption): The distribution P is such that: 

P(B(0,M)) = 1, 
where .8(0, M) denote the closed ball of radius M, with M > 0. 

Note that (Al) imposes a boundedness condition on the ran dom var i able X . We will also 
need the following regularity requirement, first introduced in IPollardl (|l982h . 



Assumption 2 (Pollard's regularity condition): The distribution P satisfies the fol- 
lowing two conditions: 

1. P has a continuous density / with respect to Lebesgue measure on M. d , 

2. The Hessian matrix of c i — > Pj( c , •) is positive definite for all optimal vector of 
clusters c*. 

It is easy to see that using the compactness of 13(0, M), (A1)-(A2) ensures that there 
exists only a finite number of optimal clusters c* € M. This number is denoted as \A4\ in 
the rest of the paper. 

Moreover, Pollard's conditions can be interpreted as follows. Denote dV j the boundary of 
the Voronoi cell Vi associated with q, for i = l,...,k. Then iLevrardl (l2012h has shown 
that a sufficient condition to have (A2) is to control the sup-norm of / on the union of all 
possible \M\ boundaries dV*' m = U^ =l dV*' m , associated with c* m G M. as follows: 



\\f\ u M » v *,m\\oo < c{d)M 

I m— 1 



d+l 



inf P{V*' r 

m=l,...,\M\,i=l,...k 



where c(d) is a constant depending on the dimension d. As a result, (A2) is guaranteed 
when the source distribution P is well concentrated around its optimal clusters, which is 
related to well-separated classes. From this point of view, Pollard' s regularity cond itions 
can be related to the margin assumption in binary classifica tion (see iTsvbakovl J200l)). We 
have in fact the following lemma due to lAntos et al.l (]2005l ) . 



Lemma 4 (jAntos et al.1 (j2005l )) Suppose (A1)-(A2) are satisfied. Then, for any c G 
B(0,M): 

var ( 7 (c, •) - 7 (c*(c), •)) < Ci||c - c*(c)|| 2 < C X C 2 (12(c) - R(c*(c))) , 



where c*(c) 6 argmin c * ||c — c* 
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Lemma H] is useful to derive fast rates of convergence for two reasons. 

Firstly, if we compil these two inequalities, we get a control of the variance of the excess 
loss 7(c) — 7(c*(c)) thanks to the excess clustering risk R(c) — R(c*(c)). Note that if 
R(c) — i?(c*(c)) < 1, it is clear that the loss class {7(c) — 7(c*(c)),c G C} is Bernstein 
according to Definition 12.11 since we have coarsely: 

P( 7 (c, -)-7(c*(c), -)) 2 < var ( 7 (c, •) - 7 (c*(c), -))+(R(c)-R(c*(c))) 2 < (dC 2 +l) (R(c) - R(c*(c))) . 

Moreover the second inequality of Lemma [H is necessary to control the complexity in- 
volved in Section [2] thanks to the following lemma: 



Lemma 5 Suppose (A1)-(A2) are satisfied. Suppose E||e|| 2 < 00. Then: 

J P|(7A(c*,.)- 7A (c,.))<CTf =1 Ar ft 



E sup \P n 

(c,c*)eCxM,||c— c* || 2 <<5 



n ' 



where C is a positive constant depending on M,k,d,P,K,r/. 

Note that E||e|| 2 < 00 comes from IPollardl (jl982l ). Gathering with (Al), it gives K\\Z | 2 < 00 
and allows to dea l with indirect observations. The proof of Lemma [5] is presented in Section 
El It is based on IPollardl (jl982h extended to the noisy setting. Under (A2) and provided 
that E||e|| 2 < 00, we use the following approximation of the convolution loss function 7 a(-, x) 
at any point c € C: 



l\(c,z) = 7a(c*,z) + (c - c*, V c 7 A (c*,z)) + ||c - c*||i? A (c*,c - c*,z), 



(17) 



where V r 7x (c* , z) is the gradien t of c ^ 7a ( c i z ) at point c* and R\(c*,c — c*,z) is a 
residual term (see IPollardl (jl982h for detai l s). W ith ([171 the complexity term is controled 
with a maximal inequality due to Massart ( 20071 ). gathering with a chaining method. 
We are now on time to state the main result of this section. 

Theorem 6 Assume (NA) holds, P satisfies (A1)-(A2) with density f £ £(7, L) and 
E||e|| 2 < 00. Then, denoting by a solution of i!6l we have, for any c* G A4: 

ER(c^) - R(c*) < Cn"^, 
where /3 = Ylf=i A; C is a positive constant whereas X = (Ai, . . . , Xd) is chosen as: 



Xi ~ n 7+2/3 = 1, . . . d. 

The proof is a direct application of Section [2] when = 1 whereas when \A4\ > 2, a more 
sophisticated geometry has to be considered (see Section [5] for details). Some remarks are 
in order. 

R ates of converg ence of Theorem [6] are fast rates when 2/3 < 7. It generalizes the result 
of Levrardl ( 2012 ) to the errors-in- variables case since we can see coarsely that rates to the 
order 0(l/n) are reached when e = 0. Here the price to pay for the inverse problem is the 
quantity 2^ i=1 /3j, related to the tail behavior of the characteristic function of the noise 
distribution rj in (NA). This rate corresponds to the ideal case where p = in Section [21 
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due to the finite-dimensional structure of the set of clusters C = {c = (c\, . . . , Cb ), a. € M. }. 



An in teresting extension is to consider richer classes such as kernel classes (see iMendelson 
and to deal with kernel k 



-means. 



Lower bounds of the form 0(\/yfn) have been stated in the direct case by iBartlett et al 



( 19981 ) for general distribution. An open problem is to derive optimality of Theorem EJ even 
in the direct case where e = 0. For this purpose, we need to construct configurations where 
both Pollard's regular ity assumption and noi s e assu mption (NA) could be used in a careful 
way. In this direction lLoustau and Marteaul (|201ll ) proposes lower bounds in a supervised 
framework under both margin assumption and (NA). 



4. Conclusion 



This paper can be seen as a first attempt into the study of both empirical minimization 
and clustering with errors-in- variables. Many problems could be considered in future works, 
from theoretical or practical point of view. 

In the problem of empirical minimization with errors-in- variables, we provide the order 
of the expected excess risk, depending on the complexity of the hypothesis space, the reg- 
ularity of the direct observations and the degree of ill-posedness. For the sake of concision, 
Theorem [3] only consider particular Bernstein classes and empirical minimization based on 
a deco nvolution kernel estimator. A higher level of generality can be derived from lLoustaul 
( 20 111 ) but is out of the scope of the present paper. 



The performances of our deconvolution fc-means algorithm is obtained thanks to a local- 
ization mincipk dueto Koltchinskii (j2006l ). where proofs iterat e a Talagrand's inequality 
due to iBousauetl (120021 ). With such a study, iKoltchinskiil (j2006h provides the order of the 
excess risk in the direct case and allows to recover most of the recent results in the statistical 
learning context. There is nice hope that many statistical learning problem when dealing 
with indirect observations could be solved with similar argues. 

In the problem of noisy /c-means clustering, we propose fast rates of convergence to the 

7 

order of 0{l/n^+ 2fi ). Theorem [6] is a direct application of the result of Theorem [3] to the 
problem of clustering with Pollard's re gularity assumptions and bounded source. It gener- 
alizes a recent result of Levrardl ( 2012 ) where fast rates are stated for direct observations. 

From practical viewpoint, this work proposes an empirical minimization to deal with 
the problem of noisy clustering. However the procedure (I16p is not adaptive in many sense. 
Of course the dependency on the noise distribution rj can be explored in a future work, and 
can be associ ated with t he pr obl em of unknown operator in the statistical inverse problem 
literature (see Marteau ( 20061 ) or Cavalier and Hengartner ( 20051 )). Moreover the empirical 
minimization depends on the Holder regularity of the density of the source distribution 
through the choice of the bandwidths Xi,i = 1, . . . d. However for practical experiments, 
any data-dependent model selection procedure can be performed, such as cross-validation. 



5. Proofs 

In all the proofs constant C > may vary from line to line. 
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5.1 Proof of Theorem [3] 

The proof uses the following intermediate lemma. 

Lemma 7 Suppose {l\(g),g € Q} is such that sup PaQ?)||oo < Define: 
U^(8 3 ,t) :=K 



^(0,6^ -.= E sup \Pn-P\Mg)-h(g')}, 



D A (g,5j):= sup \JP(i\(g)-h(g')) 2 , 

where 5j = q~ 3 ,j £ N* , for some q > 0. 

Then V<5 > 5^ (t) , we have for g = g* defined in [T3l 



F(Ri{g) - Ri{g*) >S)< c(5,q) e - t , 



where: 



#(t) = (inf { 6 > : sup < 1 1 ) V ( ±q sup (R l - R*)(g - g') 

\ S 3 >5 Oj lq\ \ g,g'&g 



The proof is a straightforward modification of the proof of Lemma 2 in lLoustaul (|201lh . 
Proof [of Theorem [3] First note that, in dimension d = 1 for simplicity: 

U*(6,t) < C (ti(g,5) + Ji<fc(g,S)(l + \-W*) + J^D\g,5) + -) . 

\ V n V n n I 

Using Definition l2.il gathering with the complexity assumption over oj n (g,5), we have: 



<t>*(g,5)<E sup \P n -P\[l x (g)-l x (g')]<C^8- 



g,g'eS(6) V n 
A control of D x {g,5) using Lemma [H gathering with Definition 12.11 leads to: 

u^s,t) < c f^V + ^ s *yfcWH + Jlx-esl + *) . 

\ \/n ra-V 4 V n n / 



We hence have from an easy calculation: 



Kit) < c 
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To get the result we apply Lemma [7] with: 



5 = C(l + t) 



noting that the choice of A warrants that: 



A 7 < C(l + t) 



Same arguments conclude the proof for d>2. 



5.2 Proof of Lemma [5] 

The proof follows Levrard ( 20121 ) applied to the noisy setting. First note that, by smoothness 
assumptions over c i— y min \ \x — Cj\\, we get, for any c 6 and c* € Ai, 

7a(c,z) - 7a(c*,z) = (c - c*), V c 7a(c*,z)) + ||c - c*[|jRa(c*,c- c*,z), 

where, with IPollardl (|l982T ) we have: 



V c :.\(c\ :) 2 ( / jJCr, ) (x-c* 1 )l vr (x)dx,..., 



and i? a (c* , c — c*,z) such that : 



R\(c*,c — c*,z)\ < ||c — c*|| I (c — c*, V c 7a(c*, z)) + max (| \\z — c|| — ||x — c 



j=l,...fe 



Splitting the expectation in two parts, we obtain: 



E sup |P n -P|( 7 A(c*,.)-7A(c,.)) <E sup \P n - P\ (c* - c, V c 7a(c*, .)> 

c*€.M,||c— c* || 2 <<5 c*€M.,\\c—c* \\ 2 <8 

+ V5E sup \P n -P\(-R x (c*,c-c*,.)) (18) 

c*e.M,||c— c*|| 2 <<5 

To bound the first term is this decomposition, consider the random variable 

z n = (p n - p) (c* - c, v c7 a(c*, .)} = \ E - <i) £ / (^r^ " Cuj)(ix - 

1i=l jr' = l i=l u 



By a simple Hoeffding's inequality, Z n is a subgaussian random variable. Its variance can 
be bounded as follows: 



k d 



Vi\vZ„ — E £( c «,j - c l,j) 2vaT I ^ ^ ^ 
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u=l j=l 



A V A 
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4 

< -SE 
n 



< C-S 

n 



- x ic n 



Z — x 



A^VA 



iF&Kj ~ c u+,j)lV u+ ](t)\ dt 



< cuf =1 xf^-s, 

where u + = arg max u J v ^/C „ (^y^) — c ll: j)dx a nd tu : x i-)- Xj , and where we use 
arguments originally stated in lLoustau and Marteaul (|201ll ) for compa ctly supported F\K] . 
We hence have using for instance a maximal inequality due to Massart Massart ( 20071 . Part 
6.1): 



E sup (P n - P) (c* - c, V c7A (c\ .)) < C 

Vc*eA4,||c-c*|| 2 <5 / 



■Vs. 



We obtain for the first te rm in (1181 the r ight order. To prove that the second term in ([18 
is smaller, note that from Pollard (1982), we have: 



R x (c*,c-c*,z)\ < ||c-c*|| (c-c*,V c 7a(cV)) + max (|||*-c 



< [|V cTA (c*,^)[| + ]| c - c*!!" 1 Yl 



j=l,...k 



\Z — Ca 



\z — c 



*l|2 



\Z — C 



*l|2| 



j=l,...k 



< C(HtiAT ft + 11*1 



we we use in last line: 



|V c 7a(c*,z) 



-fC r . 



z — X 



c* UJ )l v *(x)dx <CUf =1 X 



-2ft 



Hence it is possible to apply a chaining argument as in iLevrardl (j2012l ) to the class 



T = {R x (c*,c- c*, -),c* £ M,c£ 



C — C 



which have an enveloppe function F(-) < C{Iif =l X i + 
Ellell 2 < oo. 



<V6}, 

6 L2{P) provided that 



5.3 Proof of Theorem [6] 

Thr proof of Theorem is divided into two steps. Using Theorem [21 we can bound the 
excess risk when \M\ = 1. For the general case of a finite numbers of opti mal clusters 



I .Ml > 2, we need to introduce a more sophisticated localization explain in iKoltchinskii 
(bood T Section 4). 
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First case: \M\ = 1. 

The proof follows the proof of Theorem [3l Using the previous notations, we have: 



ttfOM) < C U*(C,6) + s JU\(c,5){l+K{\)) + \j^D\C,S) + 



Using Lemma HI gathering with Lemma [5J we have for d = 1 for simplicity: 
ti(C,5) < E sup |p n -P|[ 7A (c)- 7A (c / )] 

c,c'eC(5) 

< E sup |p n -P|[ 7A (c)- 7A (c*)]+E sup |P ri -P|[ 7A (c')-7A(c*)] 



! <8 



2 <5 



< c^sK 



where c* is the unique minimizer of the clustering risk. Moreover, by uniqueness of c*, we 
can write from Lemma [U 



D\Q,5) := sup V^(7a(c)- 7a (c')) 2 

c,c'GC(<5) 



sup VHlic) ~ 7(C)) 2 
c,c'eC(<5) 



< sup \/P( 7 (c) - 7 (c*)) 2 + sup y/P(r/{C) -7(c*)) 2 

cec(5) c'eC(5) 



It follows: 



C/„ A (<5,!) <C 



n 



3/4 



n n 



We hence have the result applying Lemma [7] with the choice of A precised in Theorem [6l 

Second case: \M\ > 2 

When the infimum is not unique, the diameter D X (Q, 5) does not necessary tend to zero 
whe n 5 — > 0. We hence introduce the more sophisticated geometric characteristic r(cr, 5) 
from lKoltchinskiil (|2006h defined as: 



r(a,S) = sup inf v /P( 7 a(c) — 7 a(c')) 2 , < a < 6. 

ceC(8) c'€C(<t) 

It is clear that r(cr, 5) < D X (Q, 5) and for 5 — > 0, we have r(cr, <5) — > 0. The idea of the proof 
of Theore m [6] is to use a mo dified version of Theorem [3] using r(a, 6) instead of D X (Q,5). 
Following iKoltchinskiil (|2006l . Theorem 4), we can use a modified version of Lemma [7J in 
order to guarantee the upper bounds of Theorem [3] when \M\ > 2. To this end, we have to 
check for d = 1 for simplicity: 



lim E sup 



e-i-0 



sup 



geQ(a) g'&g(S):P{l(g)-l(g')) 2 <r(^S)+e 



(P-P n )(h{g)-W)) 



'n 



(19) 
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and 



r(a,8)\l- < C\- 
V n 

Note that from Lemma U] and Lemma it is clear that (|19p holds since: 



(20) 



E sup sup 

ceC(<r) c'GC(<5):P( 7 (c)-7(c')) 2 <r((T,5)+e 



< E sup 

ceC(a),c*eM 



(P-P n )( 7A (c)- 7A (c*)) 



E sup 

c'eC(S) 



(P-P n )( 7x (c')- lx (c*(c'))) 



< 2E sup 

(c,c*)GCxA / !,||c— c* || 2 <c<5 

In 



(P„-P)(7a(c*)-tx(c)) 



Finally (I20p holds since we have with Lemma [U Vc E C{$), c' E C(a): 



^P(7a(c)- 7 a(c')) 2 < CA-V-P(7(c)-7(c'))) 2 

< CA^ (V^(7(c) - 7(c*(c)) 2 + V^(7(C) - 7(c*(c))) 2 ) 

< CA-^v 7 ^, 

provided that cr < 5 < 1. 
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